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Abstract 

Deep neural networks have recently achieved 
state of the art performance thanks to new train¬ 
ing algorithms for rapid parameter estimation 
and new regularization methods to reduce over¬ 
fitting. However, in practice the network archi¬ 
tecture has to be manually set by domain experts, 
generally by a costly trial and error procedure, 
which often accounts for a large portion of the 
final system performance. We view this as a lim¬ 
itation and propose a novel training algorithm 
that automatically optimizes network architec¬ 
ture, by progressively increasing model complex¬ 
ity and then eliminating model redundancy by se¬ 
lectively removing parameters at training time. 

For convolutional neural networks, our method 
relies on iterative split/merge clustering of con¬ 
volutional kernels interleaved by stochastic gra¬ 
dient descent. We present a training algorithm 
and experimental results on three different vision 
tasks, showing improved performance compared 
to similarly sized hand-crafted architectures. 

1. Introduction 

Recently, deep neural networks (DNNs) have led to signif¬ 
icant improvement in several machine learning domains, 
from speech recognition (Dahl et ah, 2012) to computer 
vision (Krizhevsky et ah, 2012; Taigmanetah, 2013) and 
machine translation (Sutskever et ah, 2014). DNNs have 
reached state of the art performance thanks to their the¬ 
oretically proven modeling and generalization capabili¬ 
ties (Hornik et ah, 1989; Hornik, 1991; Kurkova, 1992), 
and practically driven by improvements in training al¬ 
gorithms for rapid parameter estimation (Martens, 2010; 
Sutskever et ah, 2013), novel regularization methods to re¬ 
duce overfitting (Srivastava et ah, 2014) as well as ever in- 
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creasing data-sets (Deng et ah, 2009) and powerful new 
computing platforms (Chetlur et ah, 2014). However, be¬ 
fore parameter estimation (so called training) can begin the 
DNN’s structure (also called model architecture) is usually 
manually defined by domain experts (Lin et ah, 2013), and 
can often account for a substantial portion of the final sys¬ 
tem performance (Szegedy et ah, 2014). We view this step 
as a bottleneck in the current deep-learning pipeline, one 
that relies on a trial and error human expert in the loop ap¬ 
proach which is, to say the least, rather alchemic in nature. 
We want to address this basic scalability issue of the deep 
learning development pipeline with training methods that 
automatically search for DNN architectures while jointly 
estimating model parameters. 

While structural optimization is a notoriously difficult com¬ 
binatorial task, successful strategies were adopted in the 
past for (shallow) models that motivated our approach. 
For instance, for Hidden Markov Models with Gaus¬ 
sian mixture kernels, split/merge algorithms were used 
to independently vary model complexity for each HMM 
state, resulting in improved accuracy for large vocabu¬ 
lary speech recognition (Sankar, 1998). Information the¬ 
oretic methods, such as the minimum description length 
criterion, were also applied to the problem of structural 
optimization (Barron et al., 1998), resulting in improved 
performance in speech recognition (Shinoda & Watanabe, 
2000) and as well as training algorithms for auto¬ 
encoders (Hinton & Zemel, 1994). However, to the best 
of our knowledge, there is little published work on 
structural optimization in the deep learning community, 
with the notable exception of work based on empirical 
evaluation (Bergstra & Bengio, 2012) and random search 
strategies (Bergstra & Bengio, 2012). Although, recently 
Bayesian optimization of hyper-parameters have been in¬ 
troduced (Snoek et al., 2012). 

While these works are interesting, hyper-parameters are 
only one aspect of the DNN structure, albeit one which 
is closely related to the performance of the training algo¬ 
rithm. However, there are several other structural parame¬ 
ters that strongly affect DNN’s performance which are usu- 
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ally set by experimental trial and error, such as network 
depth and for convolutional models the number of convo¬ 
lutional filters and kernel size for each layer. In our work, 
we aim to optimize model architecture, specifically target¬ 
ing convolutional neural networks (CNNs), and optimiz¬ 
ing complexity for each layer. Therefore, in our approach, 
the model architecture is not maintained constant during 
training, instead the model complexity is continuously op¬ 
timized throughout the training step (parameter estimation 
by stochastic gradient descent), resulting, we believe, in a 
more scalable approach to the training of deep neural net¬ 
works. In section 2, we describe the general approach we 
are taking for problem of structure optimization of convo¬ 
lutional neural networks. In section 2.1, we describe the 
theoretical foundations of our approach. In section 3, we 
discuss data-sets and experimental results and in section 4 
we discuss about limitations and possible future improve¬ 
ments. 

2. Deep Clustered Convolutional Kernels 

The basic idea for our Deep Clustered Convolutional 
Kernels (DCCKs) it a convolutional model architecture 
and associated structural training algorithm. We adopt a 
split/merge outer-loop to the training process that first in¬ 
creases model capacity to model new factors of variabil¬ 
ity seen in the data, then estimates new parameters for this 
larger model by stochastic gradient descent (SGD), and 
finally reduces model capacity to minimize model-space 
redundancy. Our approach takes inspiration by previous 
work in the area of Gaussian kernel HMMs (Sankar, 1998; 
Rigazio et ah, 2000; Bocchieri & Mak, 2001; Lee et ah, 
2001), and is philosophically based on Occam’s razor prin¬ 
ciple whereby a smaller model with similar performance 
on a given data-set is likely to have better generalization 
capabilities to new unseen data. 

An alternative view of work may be in the context of recent 
developments in DNN’s compression: (Ba & Caruana, 
2014) shows that a (shallow) DNN can approach the per¬ 
formance of a substantially larger DNN when trained to 
mimic the logit output of the larger model. Similarly, 
(Hinton et ah, 2014) shows that logit-mimic training (re¬ 
ferred to as “Dark Knowledge”) results in orders of mag¬ 
nitude smaller models, compared to the initial complex 
ensemble models, yet provides competitive performance 
when tested on both small tasks (MNIST) as well as large 
scale industrial tasks (large vocabulary speech recogni¬ 
tion). It is important to notice that for both these works the 
authors acknowledge that, while such smaller high perfor¬ 
mance models can be obtained by logit mimic training from 
a more complex model set, thus showing that there is an op¬ 
timal point in the parameter space with high performance, 
there is currently no known training procedure to directly 


Algorithm 1 Deep Clustered Convolutional Kernels train¬ 
ing algorithm 

Input: Initial network architecture net with parameters 
A, noise variance a n and jitter angle cr a , stopping condi¬ 
tions c>o,i ,2 and mini-batch size 
while A Validation Accuracy > <5o do 
while A Validation Accuracy > dj do 
// SPLIT 

nk = gaussianNoise(<r n ) 
otu = gaussianNoise(<r a ) 

Ai = concat( A, A + nk) 

A = concat(Xi, rotate(kernel(X), ak)) 

II FINETUNE 

while A Validation Accuracy > <52 do 
runSGDLV/ minibatches) 

end while 
end while 

// MERGE 

centroid = K means (kernel s(X)) 

X = nearest(kernels(X), centroid) 

while A Validation Accuracy > 82 do 
runSGD(M minibatches) 

end while 
end while 


achieve such optimal point in the smaller model. In this 
view finding such an elusive point in parameter space by 
systematically optimizing DNN’s structure to eliminate re¬ 
dundancy and minimizing number of parameters, while at 
the same time estimating the model parameters under the 
given loss function. The main contribution of our work is 
a training methodology to iteratively optimize the number 
of convolutional kernels while estimating the convolutional 
filter parameters. 

2.1. Training algorithm 

Conceptually our training procedure is rather straightfor¬ 
ward: starting from an initial network architecture, we first 
train the model by SGD until performance tops out on a 
validation set. Next, we increase the model complexity of 
selected convolutional layers by splitting the convolutional 
kernels. Splitting has the purpose of creating new plausi¬ 
ble convolutional filters given the current set of filters and 
can be done by applying image pre-processing techniques 
to the kernels, as well as adding jittering and noise to create 
enough variation. After splitting, the model is again trained 
by SGD and possibly split again until performance tops out. 
At that point model is merged to reduce redundancy in the 
parameter space and again trained by SGD. Notice that the 
split/merge procedure can start at any layer but than has to 
propagate upwards to change the number of kernels of the 
connecting layers (fan-out). In our setup, given by input 
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data x, forward propagation / is done by: 


f{x)=g(Wx + B) (1) 

where g is ReLU activation function with g{x) = 
max( 0,x), W is the weight parameters of the convolu¬ 
tional layer, and B is the biases, each with the following 
dimensions: 

p 

Wi = Ni x di x ki x ki (2) 


Bi = 1 x 1 x 1 x Ni (3) 

where l is a convolutional layer with l £ {1,...,L}, A') 
is the number of outputs of l, di is number of channels of 

1, and ki is size of kernel used for l. We use square con¬ 
volutional kernels, so kernel dimensions are ki x A:/. For 
simplicity, we define sub-dimension of W as P, shown in 

2. For the first convolutional layer we have d\ = 3 for 
RGB images and d\ = 1 for gray-scale images. In the 
following convolutional layers, d is the output of the pre¬ 
vious convolutional layer thus, P would be the size of the 
feature vectors. This implies that, when we perform the 
split/merge steps for level l, we need to update both Wi and 
Bi as well as Wi + i . Biases for the following convolutional 
layer are independent. An important caveat is that the order 
of the optimal split/merge operation depends on the specific 
data-set and the filter parameters. For instance, if the initial 
filters are sparse it is beneficial to do merge first. Other¬ 
wise, it is best to perform split first especially on smaller 
data-sets when the initial filters are already compact and 
discriminative. 

2.1.1. Splitting Kernels 

With splitting, we want to increase model complexity by 
creating new convolutional kernels from the set of existing 
well-trained kernels. Therefore, we create new kernels by 
selectively choosing from a fixed set of transformations. 
The possible set of transformations to play with is vast and 
includes the six isometries of the plane, angular rotation, 
change in contrast (negative “reversing”) and many others. 
In our experiments, we focus on two transformations that 
seemed to provide a consistent improvement: 

• Rotation creates new kernels by rotating existing ker¬ 
nels in random directions. 

• Noise perturbation creates new kernels by adding 
Gaussian noise to the existing kernels. 

One important aspect we verified in our experiments is that 
rotating kernels has a lower computational cost at train¬ 
ing time than rotating training images to create augmented 


training set. Moreover, we observed that rotating the fil¬ 
ters can help improve robustness for highly tilted objects 
outlets, which would be otherwise hard to correctly clas¬ 
sify (see Figure 1). Adding random Gaussian noise, on 
the other hand, has the obvious benefit of creating diver¬ 
sity and helping with the SGD, like previously reported 
by (Srivastava et al., 2014). Regarding the splitting strat¬ 
egy, currently we took the simplest approach and split ev¬ 
ery kernel by a fixed amount. This is bound to be locally 
unoptimal, and surely a better splitting strategy that tries to 
maximize some diversity or discrimination criteria could be 
devised, instead of indiscriminately splitting every single 
kernel. However, for the most part, we observe that waste¬ 
ful parameters created by this simple splitting strategy will 
be eliminated during the final merging step; therefore, aside 
from a potential sub-optimality in the CPU/Memory usage, 
we speculate the final model accuracy might not be very 
affected by this uniform splitting strategy. 



Figure 1. (a) Highly tilted, misclassified test image (b) Soft-max 
output of original baseline model resulting in miss-classification 
(c) Baseline model convolutional kernels: notice high proportion 
of redundant kernels (d) Soft-max at DCCKs intermediate train¬ 
ing stage, after split and fine-tuning (e) Final DCCK convolutional 
kernels, after merge and fine-tuning, showing reduced redundancy 
(f) Final DCCK soft-max output, correctly classifying the image 


2.1.2. Merging Kernels 

After the splitting step, the model might have too much 
capacity and thus part of the model might become over¬ 
parameterized, possibly resulting in over-fitting and lower 
generalization power. Therefore the merging step has the 
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purpose of removing model space redundancies and reduc¬ 
ing model size, while maintaining the overall model ac¬ 
curacy. In our algorithm we use A-means clustering to 
merge kernels since, naturally, A-means cluster distortion 
under the defined distortion measure (we employ L2 norm 
to compute cluster distortion). We empirically observe that 
A-means clustering to merge filter maps is effective in re¬ 
ducing kernel’s redundancy (see filters in 2.1.2). Then, we 
train the network and get weight and bias matrices from 
each convolutional layer to then choose the filters that are 
nearest to each centroid. We update Wi and Bi, with W{ 
and B[ using /c-means clustering to get centroids C as: 


C = argmin^] ^ |\p - Pj\ 


j =i peP 


(4) 


tt ri _ ) [P {; J •••) Pel > or 

1 1 [C' 1 ,..,C is ...,C'c] 


(5) 


where 


P- = arg min \\P' - Ci || 2 , * = {1,...,C} (6) 

p' 


and finally, 


, _ i [B [,..., B [,..., B' c \, or 
1 1 


(7) 


where B[ is P/’s matched biases matrix, and fa is 


E b n 

Pi = - - , n={l,..JV*} (8) 

Vi 

where r/, is number of p in group C). 

As shown in 5 and 8, we explored two different methods 
to update W and B. The first method consists in choosing 
the Pi that is closer to each centroid C*. In this case, we 
use the correspondent bias vector B, to the corresponding 
Pi selected. The other way is to use the centroid Ci itself 
as filter parameters and update /i, with average bias from 
each cluster. An important detail to choose the right value 
of k: if we choose k too small then average cluster distor¬ 
tion will be too high to appropriately represent the model 
parameters, possibly resulting in ineffective features maps. 
On the other hand, if we choose k too big, not enough ker¬ 
nels will be merged. This, unfortunately, may very well be 
a hyper-parameter that will have to be manually tuned. Ta¬ 
ble 2 shows k selected for each experiment which gives the 
best results on our network models. 


150 kernels 



Figure 2. DCCK training example: starting from a large GTSRB 
model 150 convolutional kernels for the first layer, the algorithm 
first merges it to 32 kernels. After fine-tuning, kernels are split by 
adding noise and rotating, then fine-tuned one more time. 


3. Experimental results 

Our experimental results are based on three different data¬ 
sets: MNIST, German Traffic Sign Recognition Bench¬ 
mark (GTSRB), and CIFAR-10. As much as possible, to 
make our experiments significant and to validate our ap¬ 
proach, we started from hand-tuned model architectures 
that were as close as possible to the state of the art, in an 
effort to prove that our split/merge training procedure can 
still improve model architecture even when starting from 
a very highly tuned architecture. Baseline performance are 
reported in Table 1. For all experiments, we used the BVLC 
Caffe C++ package (Jia et ah, 2014). We started our exper¬ 
iments from MNIST since the quick training time allowed 
to quickly determine reasonable range of hyper-parameters 
such as the number of centroids k, number of kernels for 
the split/merge procedure. Next, we move to a more re¬ 
alistic task such as GTSRB for which we started from an 
initial model, extremely close to the state of the art and fi¬ 
nally confirm the portability of our findings on the harder 
CIFAR-10 data-set. We report the details of each data-set 
experiments in the following sections. 
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MNIST GTSRB1 GTSRB-3DNN CIFAR-10 No. STAGE CONVl CONV2 Err(%) 


0.82% 2.44%1.24%10.4% 

Table 1. Baseline models performance on the three data-sets se¬ 
lected for our experiments. 


Layer 

# of maps 

Kernel 

Input 

3 


Convolutional 

100 

5x5 

Max Pooling 

100 

2x2 

Convolutional 

50 

5x5 

Max Pooling 

50 

2x2 

Fully connected 

100 

lxl 

Fully connected 

10 

lxl 


Table 2. MNIST baseline architecture 


3.1. MNIST results 

The MNIST data-set contains 60,000 training images and 
10,000 testing images of hand-written digits of size 28x28. 
The baseline model is composed of two convolutional lay¬ 
ers and two fully-connected layers, as shown in Table 2, 
with ReLU and pooling following each convolutional layer. 
This baseline model achieves 0.82% Error rate with this 
simple network. The DCCKs training algorithm begins by 
splitting the first convolutional layer from 100 to 200 ker¬ 
nels; after the subsequent fine-tuning the model achieved 
0.59% error rate, which is almost 30% relative improve¬ 
ment from the original model. This compared favorably to 
a 200 kernel models trained from scratch, which achieves 
0.78%, and even a 300 kernels model trained from scratch, 
which achieves 0.75%. This verifies that splitting filters has 
the potential to help the following SGD based fine-tuning 
to achieve an optimal point which generalists better. Also, 
more importantly after following merging step, back to 100 
kernels, the performance dropped only 0.01% to an error 
rate of 0.59%. 

3.2. GTSRB results 

The GTSRB data-set contains 39,209 training images and 
12630 testing images of various size, with 43 differ¬ 
ent classes consisting of standard traffic signs from Ger¬ 
many (Houben et ah, 2013). First, we resized all images 
to 48x48 and then we applied pre-processing techniques 
such as histogram equalization, adaptive histogram equal¬ 
ization, and contrast normalization. For this task, we have 
two sets of initial networks: a single model baseline GT¬ 
SRB 1, consisting of three convolutional and two fully con¬ 
nected reaching 2.44% error rate, and larger state of the 
art ensemble model GTSRB-3DNN (Table 4), inspired by 
MCDNN(Ciresan et ah, 2012), and reaching 1.24% error 


1 

ORIGINAL 

100 

50 

0.82 

2 

ORIGINAL 

200 

50 

0.78 

3 

ORIGINAL 

300 

50 

0.75 

4 

SPLIT FROM [1] 

200 

50 

0.58 

5 

MERGE FROM [4] 

100 

50 

0.59 


Table 3. MNIST error rate after fine-tuning. Notice that clustering 
was performed on the first convolutional layer only. 


Layer # of Maps Kernel(GTSRB-3DNN) 


Input 

3 


Convolutional 

150 

3x3, 3x3, 3x3 

Max Pooling 

150 

2x2, 2x2, 2x2 

Convolutional 

150 

4x4, 4x4, 2x2 

Max Pooling 

150 

2x2, 2x2, 2x2 

Convolutional 

250 

4x4 4x4, 2x2 

Max Pooling 

250 

2x2 2x2, 2x2 

Fully Connected 

500 

lxl lxl,lxl 

Fully Connected 

43 

lxl lxl,lxl 


Table 4. GTSRB-3DNN architecture 


rate, which is within 0.2$ from the best published re¬ 
sult. We remark the ensemble models use different input 
size of 48x48 pixels, 38x48 pixels and 28x48 pixels: be¬ 
cause of this, we expected a high degree of redundancy on 
the GTSRB-3DNN kernels which may be successfully ex¬ 
ploited by the DCCKs merging step. Indeed, by visually 
inspecting the lower convolutional layers we could easily 
identify an abundant amount of redundancy (see 2.1.2). Be¬ 
cause of this highly redundant structure in the initial model, 
we inverted the sequence of our training procedure to first 
merge kernels instead of splitting, which maintains the ac¬ 
curacy and provides significantly faster training 8. 

Furthermore, the specific structure of the traffic signs pro¬ 
vided for some peculiar behaviors on this database: for 
instance, kernel rotation especially helped improving per¬ 
formance. A detailed inspection of the recognition errors 
highlighted that several traffic signs were misclassified by 
the baseline model were highly tilted; such instances were 
mostly recovered and correctly recognized after DCCKs 
training (see Figure 1 for one example of such instance). 
We also remark that using centroids as new kernels resulted 
in better gains on this data set. 

Table 5 and Table 6 shows the experimental results. We 
remark that in all the experiments, in almost all cases, we 
either achieve significantly better performance or similar 
performance with significantly reduced model size. One 
exception worth noticing is [5] in Table 5 which shows 
the worst performance of all experiments: in this case we 
merged the last convolutional layer which is fully con- 
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No. 

STAGE 

CONVl 

CONV2 

CONV3 

Err(%) 

1 

ORIGINAL 

150 

150 

250 

2.44 

2 

MERGE [1] 

32 

150 

250 

2.34 

3 

MERGE [2] 

32 

32 

250 

2.7 

4 

MERGE [2] 

32 

64 

250 

2.36 

5 

MERGE [3] 

32 

32 

32 

3.82 

6 

SPLIT [2] 

64 

150 

250 

2.5 

7n 

SPLIT [3] 

32 

64 

250 

2.25 

8r 

SPLIT [3] 

32 

64 

250 

2.15 

9 

SPLIT [1] 

300 

150 

250 

2.24 

10 

MERGE [1] 

40 

150 

250 

2.31 

11 

SPLIT [1] 

150 

300 

250 

2.27 


Table 5. GTSRB1 baseline model experiments, ’R' denotes ’Ro¬ 
tation’, and ’N" denotes ’Noise perturbation’. Remark that [8R] 
which splits both the first and the second convolutional layer, fol¬ 
lowed by merge of the second layer, achieved the best perfor¬ 
mance. Instead, [5] which merges the last convolutional layer had 
a performance drop; we speculate that this is due to difficulty in 
optimizing the following fully connected layer. 


No. 

STAGE 

CONVl 

CONV2 

CONV3 

Err(%) 

1 

ORIGINAL 

150 

150 

250 

1.24 

2 

ORIGINAL 

16 

150 

250 

1.67 

3 

MERGE [1] 

32 

150 

250 

1.18 

4 

MERGE [1] 

16 

150 

250 

1.25 

5 

SPLIT [1] 

300 

150 

250 

1.21 

6 

SPLIT [3] 

64 

150 

250 

1.15 


Table 6. Results table for DCCK trained from the state of the art 
GTSRB-3DNN initial model, showing a small but significant im¬ 
provement. 


nected to the first fully connected layer of this network ar¬ 
chitecture. We speculate this issue is due to the fact that is 
notoriously hard to optimize parameters of fully connected 
layers, splitting a convolutional layer which fans-out into a 
fully connected layer has the potential to harm the parame¬ 
ter structure to a point where SGD cannot easily recover. 

3.3. CIFAR10 results 

The CIFAR-10 data-set consists of 50,000 training and 
10,000 testing images. Each image is 32x32 pixels and 
represent a class of natural occurring objects. To develop 
the CIFAR-10 baseline we used the same techniques dis¬ 
cussed in (Goodfellow et al., 2013) and the Network-In- 
Network (Lin et al., 2013) model which achieves a base¬ 
line 10.4% error rate, which is within reasonable distance 
from to the state of the art. When we apply DCCKs train¬ 
ing on the CIFAR-10 data-set, the increased performance is 
not as large as on the previous data-sets but it is still signif¬ 


No. 

STAGE 

CONVl 

CONV2 

CONV3 

Err(%) 

1 

ORIGINAL 

192 

192 

192 

10.4 

2 

SPLIT [1] 

384 

192 

192 

10.29 

3 

SPLIT [1] 

576 

192 

192 

10.25 

4 

MERGE [3] 

192 

192 

192 

10.2 

5 

SPLIT [1] 

192 

192 

384 

10.04 

6 

SPLIT [1] 

192 

384 

192 

10.04 

7 

MERGE [6] 

192 

192 

192 

10.28 


Table 7. Result table for CIFAR-10. 


icant and consistent. We believe that this is because the 
highly successful highly (manually) optimized Network- 
In-Network architecture makes it harder for the automat¬ 
ically devised DCCKs training to provide a large improve¬ 
ment. Therefore these results should demonstrate that DC¬ 
CKs may still provide some improvement even when ap¬ 
plied on top of more complex highly tuned architectures, 
while keeping the number of parameters under control. Ad¬ 
ditionally we show that by splitting layers and doubling the 
number of parameters we could achieve an additional 0.5% 
average error rate improvement. 



Figure 3. Test-set accuracy of GTSRB1 (simple) network during 
fine-tuning. Notice that GTSRB 1 .merge and GTSRB 1 .merge 
have the same number of parameters, but the optimized DCCK 
architecture shows better accuracy throughout epochs. 
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Figure 4. Test-set loss of GTSRB1 (simple) network during fine- 
tuning. Notice that GTSRB1 .merge and GTSRB1 .merge have the 
same number of parameters, but the optimized DCCK architecture 
shows better accuracy throughout epochs. 


4. Discussion 

In this work, we introduced the concept of DCCKs and in¬ 
troduced a training procedure whereby convolutional ker¬ 
nels learned by SGD can be effectively split and merged. 
Experimental results confirmed this process results in grad¬ 
ually improving performance, while the training algorithm 
jointly optimizes structure as well as model’s parameters. 
Results show that DCCKs can make parsimonious use of 
model capacity by converging towards the minimal number 
of parameters that gives the best performance, even when 
starting with highly manually optimizing network architec¬ 
ture. Figure 3 shows training and validation data loss over 
fine-tune epochs; the “original” and the “merge” curves re¬ 
fer to training and generalization loss for models having 
the same number of parameters; notice how the “merge” 
curve is consistently above the “original” curve, apparently 
providing an upper-bound to the loss, and thus empirically 
confirming that the DCCKs architecture was indeed an im¬ 
proved by the training algorithm. Moreover, in some ex¬ 
periments, DCCKs resulted in significantly higher perfor¬ 
mance with smaller number of parameters than the original 
model. On the other hand, DCCKs showed bigger gains 
on simpler databases, such as MNIST and GTSRB, than on 
more complex CIFAR-10 data-set and to the more complex 
Network-In-Network model architecture. This is however 


Model 

STAGE 

CONVl 

CONV2 

SPEED (MS) 

1. SIMPLE 

ORIGINAL 

150 

150 

14.8 

2. SIMPLE 

MERGE [1] 

32 

150 

14.1 

3. SIMPLE 

MERGE [2] 

32 

64 

12.6 

4. 3-DNNs 

ORIGINAL 

150 

150 

27.9 

5. 3-DNNs 

MERGE [4] 

32 

150 

19.4 


Table 8. Speed comparisons for GTSRB 1 and GTSRB-3DNNs 
models and their corresponding DCCK trained models. Test time 
of forward-pass only with minibatches of 10 48x48 pixel images 
on nVidia GeForce GTX 770. 

to be expected, especially because the NIN architecture is 
extremely well tuned and very high performance to begin 
with, so it is natural to expect smaller gains by our auto¬ 
matic structure optimization procedure. Beside the obvious 
advantage of automatic structure optimization, a side ben¬ 
efit of DCCKs training is that manipulating kernels takes 
less computations than pre-processing training data, which 
makes DCCKs optimization more efficient. 

To conclude, we believe there are several aspects of DC¬ 
CKs training algorithm that could be improved. As we 
mentioned in 2.1.1 currently all kernels are split by the 
same amount. However, one could argue that some ker¬ 
nels might be better than others and should be replicated 
first, possibly based on the ability provide new discrimina¬ 
tive features. If we could determine such kernels we could 
potentially improve training speed, though, potentially final 
accuracy after the merge step might not be much impacted 
as much. Finding a more extensive set of kernel trans¬ 
formations to achieve a highly selective split step would 
also be an appropriate next step, as well as comparison and 
combination with logit-mimic training and model compres¬ 
sion techniques (Ba & Caruana, 2014; Hinton et ah, 2014). 
Ultimately, like for any new methodology in the deep¬ 
learning sector, it would be very important to test how well 
DCCKs scale higher dimensional larger problems, such 
as IMAGE-NET and to different non-vision tasks such as 
speech recognition or language modeling. 
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